S$^3$Attention: Improving Long Sequence Attention with Smoothed Skeleton Sketching

Abstract

Attention based models have achieved many remarkable breakthroughs innumerous applications. However, the quadratic complexity of Attention makes thevanilla Attention based models hard to apply to long sequence tasks. Variousimproved Attention structures are proposed to reduce the computation cost byinducing low rankness and approximating the whole sequence by sub-sequences.The most challenging part of those approaches is maintaining the proper balancebetween information preservation and computation reduction: the longersub-sequences used, the better information is preserved, but at the price ofintroducing more noise and computational costs. In this paper, we propose asmoothed skeleton sketching based Attention structure, coined S$^3$Attention,which significantly improves upon the previous attempts to negotiate thistrade-off. S$^3$Attention has two mechanisms to effectively minimize the impactof noise while keeping the linear complexity to the sequence length: asmoothing block to mix information over long sequences and a matrix sketchingmethod that simultaneously selects columns and rows from the input matrix. Weverify the effectiveness of S$^3$Attention both theoretically and empirically.Extensive studies over Long Range Arena (LRA) datasets and six time-seriesforecasting show that S$^3$Attention significantly outperforms both vanillaAttention and other state-of-the-art variants of Attention structures.

Quick Read (beta)

loading the full paper ...